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Abstract 

Nonnegative matrix factorization (NMF) is now a common tool for audio source separation. When 
learning NMF on large audio databases, one major drawback is that the complexity in time 
is 0{FKN) when updating the dictionary (where {F,N) is the dimension of the input power 
spectrograms, and K the number of basis spectra), thus forbidding its application on signals 
longer than an hour. We provide an online algorithm with a complexity of 0{FK) in time and 
memory for updates in the dictionary. We show on audio simulations that the online approach is 
faster for short audio signals and allows to analyze audio signals of several hours. 



1 Introduction 

In audio source separation, nonnegative matrix factorization (NMF) is a popular technique for 
building a high-level represent ation of complex audio s ignals with a small number of basis spectra , 



forming together a dictionary ( Smaragdis et al. . 2007 : Fevotte et all . 2009; Virtanen et al. . 20081 ) . 



Using the Itakura-Saito divergence as a measure of fit of the dictionary to t he training set allow s 



to capture fine structure in the power spectrum of audio signals as shown in (jFevotte et al.l . l2009l l 



However, estimating the dictionary can be quite slow for long audio signals, and indeed in- 
tractable for training sets of more than a few hours. We propose an algorithm to estimate Itakura- 
Saito NMF (IS-NMF) on audio signals of possibly infinite duration with tractable memory and 
time complexity. This article is organized as follows : in Section [21 we summarize Itakura-Saito 
NMF, propose an algorithm for online NMF, and discuss implementation details. In Section [3l we 
experiment our algorithms on real audio signals of short, medium and long durations. We show 
that our approach outperforms regular batch NMF in terms of computer time. 
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2 Online Dictionary Learning for the Itakura-Saito di- 



vergence 



Various methods were recently proposed for online dictionary learning (iMairal et alJ . l2010l : lHoffman et al 



20ld : iBucak and Gunsel hoO^ ) . Howeyer, to the best of our knowledge, no algorithm exists for on- 



line dictionary learning with the Itakura-Saito diyergence. In this section we summarize IS-NMF, 
then introduce our algorithm for online NMF and explain briefly the mathematical framework. 



2.1 Itakura-Saito NMF 

Define the Itakura-Saito divergence as djs(y,x) = J^^A^ — log ^ — 1). Given a data set V = 
{vi,...,vn) G M^""^, Itakura-Saito NMF consists in finding W G M^""^, H = (/ii, . . . , /iat) G 



^KxN ^Yiat minimize the following objective function : 



1 ^ 

jC-H{W) = -Y,dls{Vn,Whn), (1) 



N 

n=l 



The standard approach to solving IS-NMF i s to opt imize alternately in W and H and use 



majorization-minimization (iFevotte and Idierl . lin pressi ). At each step, the objective function is 



replaced by an auxiliary function of the form CniW, W) such that CuiW) < CniW^W^) with 
equality \iW = W : 

Ch{W,W) = ^ J- + Bf.Wf, + c. (2) 
fk 

where A,B R^'^^ and c G M are given by: 

Afk =T.n=lHknVfn{WH)-lW%, 



fn ' (3) 



Thus, updating W by Wfk = Afk/Bfk yields a descent algorithm. Similar updates can be 
foun d for hn so that the whole pr ocess defines a descent algorithm in (W, H) (for more details see, 
e.g., ( Fevotte and Idler , in presd )). In a nutshell, batch IS-NMF works in cycles: at each cycle. 



all sample points are visited, the whole matrix H is updated, the auxiliary function in Eq. ([2]) is 
re-computed, and W is then updated. We now turn to the description of online NMF. 



2.2 An online algorithm for online NMF 

When N is large, multiplicative updates algorithms for IS-NMF become expensive because at 
the dictionary update step, they involve large matrix multiplications with time complexity in 
0{FKN) (computation of matrices A and B). We present here an online version of the classical 
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multiplicative updates algorithm, in the sense that only a subset of the training data is used at 
each step of the algorithm. 

Suppose that at each iteration of the algorithm we are provided a new data point vt, and we 
are able to find ht that minimizes dis{vt, W^^^ht). Let us rewrite the updates in Eq. ([3]). Initialize 
^(0)^ ij{o)^ ^(0) gg^(,]^ g|.gp compute : 

B'''= B''-'' + WU^M^ (4) 

Now we may update W each time a new data point vt is visited, instead of visiting the whole 
data set. This differs from batch NMF in the following sense : suppose we replace the objective 
function in Eq. ([1]) by 

1 ^ 

LT{W) = -Y,dis{vt,Wht), (5) 

t=i 

where (ui, f2, . . . , Wi, . . . ) is an infinite sequence of data points, and the sequence {hi, ... ,ht, ...) is 
such that ht minimizes dis{vt,W^*'^h). Then we may show that the modified sequence of updates 
corresponds to minimizing the following auxiliary function : 

k f ^ / 

If T is fixed, this problem is exactly equivalent to IS-NMF on a finite training set. Whereas in the 
batch algorithm described in Section [2.11 all H is updated once and then all W, in online NMF, 
each new ht is estimated exactly and then W is updated once. Another way to see it is that in 
standard NMF, the auxiliary function is updated at each pass through the whole dataset from 
the most recent updates in H, whereas in online NMF, the auxiliary function takes into account 
all updates starting from the first one. 



Extens ions Prior inforni a ,tion on H or W is often useful for impo sing structure in the factor- 



ization (jLefevre et al.l . l201ll : IVirtanenl . 120071 : ISmarag dis et all . [2OO7I 1. Our framework for online 



NMF easily accomodates penalties such as : 

• Penalties depending on the dictionary W only. 

• Penalties on H that are decomposable and expressed in terms of a concave increasing function ip 



(jLefevre et all . l201l|): ^{H) = ^^^^ ^'(Efc 



kn ) 



2.3 Practical online NMF 

We provided a description of a pure version of online NMF, we now discuss several extensions 
that are commonly used in online algorithms and allow for considerable gains in speed. 
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Algorithm 1 Online Algorithm for IS-NMF 

Input training set, W^^\ A'<°\ S^, p, /3, rj, e. 

t ^ 

repeat 

draw Vt from the training set. 

hf ^ arg min^j djs{£ + Vt,s + Wh) 

if t = [f3] 

Bit) ^ + pZl=t-0+ib^'^ 

for A; = 1 ... K 

s^EfWfk, Wfk^Wfk/s 
A/fc Afk/s , Bfk Bfk X s 
end for 
end if 

until ||iyW - W^'-^'^Wf < V 



Finite data sets. When working on finite training sets, we cycle over the training set several 
times, and randomly permute the samples at each cycle. 

Sampling method for infinite data sets. When deahng with large (or infinite) training 
sets, samples may be drawn in batches and then permuted at random to avoid local correlations 
of the input. 

Fresh or warm restarts. Minimizing dis{vt, Wht) is an inner loop in our algorithm. Finding 
an exact solution hf for each new sample may be costly (a riilc of thumb is 100 iterations from a 
random point). A shortcut is to stop the inner loop before convergence. This amounts to compute 
only an upper-bound of dis{vt,Wht). Another shortcut is to warm restart the inner loop, at the 
cost of keeping all the most recent regression weights H = (hi, . . . , /ijv) in memory. For small data 
sets, this allows to run online NMF very similarly to batch NMF : each time a sample is visited 
ht is updated only once, and then W is updated. When using warm restarts, the time complexity 
of the algorithm is not changed, but the memory requirements become 0{{F + N)K). 

Mini-batch. Updating W every time a sample is drawn costs 0{FK) : as shown in simulations, 
wc may save some time by updating W only every /3 samples i.e., draw samples in batches and 
then update W. This is also meant to stabilize the updates. 
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Scaling past data. In order to speed up the online algorithm it is possible to scale past 
information so that newer information is given more importance : 

Bi^^P) = B{t) + pY!Xlih^'\ 

where we choose p = r^/^ . We choose this particular form so that when — )• +00, p = 1. 
Moreover, p is taken to the power j3 so that we can compare performance for several batch sizes 
and the same parameter r. In principle this rescaling of past information amounts to discount 
each new sample at rate /?, thus replacing the objective function in Eq. ^ by : 

Rescaling W. In order to avoid the scaling ambiguity, each time W is updated, we rescale 
II^(*) so that its columns have unit norm. A^^\ i?^*^ must be rescaled accordingly (as well as H 
when using warm restarts). This does not change the result and avoids numerical instabilities 
when computing the product WH. 

Dealing with small amplitude values. The Itakura-Saito divergence dis{y,x) is badly 
behaved when either y = or x = 0. As a remedy we replace it in our algorithm by djsis+y, £+x). 
The updates were modified consequently in Algorithm [H 

Overview. Algorithm [T] summarizes our procedure. The two parameters of interest are the 
mini-batch size /? and the forgetting factor r. Note that when /3 = N, and r = 0, the online 
algorithm is equivalent to the batch algorithm. 

3 Experimental study 

In this section we validate the online algorithm and compare it with its batch counterpart. A 
natural criterion is to train both on the same data with the same initial parameters W^^'^ (and 
H^^^ when applicable) and compare their respective fit to a held-out test set, as a function of 
the computer time available for learning. The input data are power spectrogram extracted from 
single-channel audio tracks, with analysis windows of 512 samples and 256 samples overlap. All 
silent frames were discarded. 

We make the comparison for small, medium, and large audio tracks (resp. 10^, 10^,10^ time 
windows). W is initialized with random samples from the train set. For each process, several 
seeds were tried, the best seed (in terms of objective function value) is shown for each process. 
Finally, we use e = 10~^^ which is well below the hearing threshold. 

Small data set (30 seconds) . We ran online NMF with warm restarts and one update of h 
every sample. From Figured! we can see that there is a restriction on the values of (/3,r) that we 
can use : if r < 1 then /3 should be chosen larger than 1. On the other hand, as long as r > 0.5, 
the stability of the algorithm is not affected by the value of /3. In terms of speed, clearly setting 



r < 1 is crucial for the online algorithm to compete with its batch counterpart. Then there is a 
tradeoff to make in /3 : it should picked larger than 1 to avoid instabilities, and sm aller than the 
size of the train set for faster learning (this was also shown in (jMairal et all . I2OI0I ) for the square 
loss) . 
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Figure 1: Comparison of online and batch algorithm on a thirty-seconds long audio track. 



Medium data set (4 minutes). We ran online NMF with warm restarts and one update 
of h every sample. The same remarks apply as before, moreover we can see on Figure [2] that the 
online algorithm outperforms its batch counterpart by several orders of magnitude in terms of 
computer time for a wide range of parameter values. 

Large data set (1 hour 20 minutes). For the large data set, we use fresh restarts and 100 
updates of h for every sample. Since batch NMF does not fit into memory any more, we compare 
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Figure 2: Comparison of online and batch algorithm on a three- minutes long audio track. 
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online NMF with batch NMF learnt on a subset of the training set. In Figure [3l we see that 
running online NMF on the whole training set yields a more accurate dictionary in a fraction of 
the time that batch NMF takes to run on a subset of the training set. We stress the fact that we 
used fresh restarts so that there is no need to store H offline. 

r = 0.70 




p l.Oe+03 
batch 



CPU time 

Figure 3: Comparison of online and batch algorithm on an album of Django Reinhardt (1 hour 20 
minutes) . 



Summary. The online algorithm we proposed is stable provided minimal restrictions on the 
values of the parameters (r, /3) : if r = 1, then any value of /3 is stable. If r < 1 then (3 should 
be chosen large enough. Clearly there is a tradeoff in choosing the mini-batch size f3, which is 
explained by the way it works : when /3 is small, frequent updates of W are an additional cost as 
compared with batch NMF. On the other hand, when /3 is small enough we take advantage of the 
redundancy in the training set. From our experiments we find that choosing r = 0.7 and /3 = 10^ 
yields satisfactory performance. 

4 Conclusion 

In this paper we make several contributions : we provide an algorithm for online IS-NMF with 
a complexity of 0{FK) in time and memory for updates in the dictionary. We propose several 
extensions that allow to speedup online NMF and summarize them in a concise algorithm (code 
will be made available soon). We show that online NMF competes with its batch counterpart on 
small data sets, while on large data sets it outperforms it by several orders of magnitude. In a 
pure online setting, data samples are processed only once, with constant time and memory cost. 
Thus, online NMF algorithms may be run on data sets of potentially infinite size which opens up 
many possibilities for audio source separation. 
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